252 research outputs found

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    Ultrasonic characterization of the pulmonary venous wall: echographic and histological correlation

    Get PDF
    Background: Pulmonary vein isolation with radiofrequency catheter ablation techniques is used to prevent recurrences of human atrial fibrillation. Visualization of the architecture at the venoatrial junction could be crucial for these ablative techniques. Our study assesses the potential for intravascular ultrasound to provide this information. Methods and Results: We retrieved 32 pulmonary veins from 8 patients dying from noncardiac causes. We obtained cross-sectional intravascular ultrasound (IVUS) images with a 3.2F, 30-MHz ultrasound catheter at intervals on each vein. Histological cross-sections at the intervals allowed comparisons with ultrasonic images. The pulmonary venous wall at the venoatrial junction revealed a 3-layered ultrasonic pattern. The inner echogenic layer represents both endothelium and connective tissue of the media (mean maximal thickness, 1.4±0.3 mm). The middle hypoechogenic stratum corresponds to the sleeves of left atrial myocardium surrounding the external aspect of the venous media. This layer was thickest at the venoatrial junction (mean maximal thickness, 2.6±0.8 mm) and decreased toward the lung hilum. The outer echodense layer corresponds to fibro-fatty adventitial tissue (mean maximal thickness, 2.15±0.36 mm). We found a close agreement among the IVUS and histological measurements for maximal luminal diameter (mean difference, -0.12±1.3 mm) and maximal muscular thickness (mean difference, 0.17±0.13 mm) using the Bland and Altman method. Conclusions: Our experimental study demonstrates for the first time that IVUS images of the pulmonary veins can provide information on the distal limits and thickness of the myocardial sleeves and can be a valuable tool to help accurate targeting during ablative procedures

    Integration and exploitation of intra-routine malleability in BLIS

    Full text link
    [EN] Malleability is a property of certain applications (or tasks) that, given an external request or autonomously, can accommodate a dynamic modification of the degree of parallelism being exploited at runtime. Malleability improves resource usage (core occupation) on modern multicore architectures for applications that exhibit irregular and divergent execution paths and heavily depend on the underlying library performance to attain high performance. The integration of malleability within high-performance instances of the Basic Linear Algebra Subprograms (BLAS) is nonexistent, and, in addition, it is difficult to attain given the rigidity of current application programming interfaces (APIs). In this paper, we overcome these issues presenting the integration of a malleability mechanism within BLIS, a high-performance and portable framework to implement BLAS-like operations. For this purpose, we leverage low-level (yet simple) APIs to integrate on-demand malleability across all Level-3 BLAS routines, and we demonstrate the performance benefits of this approach by means of a higher-level dense matrix operation: the LU factorization with partial pivoting and look-aheadThe researchers from Universidad Complutense de Madrid were supported by the EU (FEDER) and Spanish MINECO (TIN2015-65277-R, RTI2018-093684-B-I00), and by Spanish CM (S2018/TCS-4423). The researcher from Universitat Poliecnica de Valencia was supported by the Spanish MINECO (TIN2017-82972-R)Rodríguez-Sánchez, R.; Igual, FD.; Quintana-Ortí, ES. (2020). Integration and exploitation of intra-routine malleability in BLIS. The Journal of Supercomputing (Online). 76(4):2860-2875. https://doi.org/10.1007/s11227-019-03078-zS28602875764Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp Spec Issue Euro Par 2009(23):187–198Catalán S, Castelló A, Igual FD, Rodríguez-Sánchez R, Quintana-Ortí ES (2019) Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Comput. https://doi.org/10.1007/s10586-019-02927-zCatalán S, Herrero JR, Quintana-Ortí ES, Rodríguez-Sánchez R, Van De Geijn R (2019) A case for malleable thread-level linear algebra libraries: the LU factorization with partial pivoting. IEEE Access 7:17617–17633Catalán S, Igual FD, Mayo R, Rodríguez-Sánchez R, Quintana-Ortí ES (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19(3):1037–1051Chan E, Van Zee FG, Bientinesi P, Quintana-Ortí ES, Quintana-Ortí G, van de Geijn R (2008)Supermatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, pp 123–132Corporation I (2019) Intel ® math kernel library developer reference. Tech rep, Intel Corporation. https://software.intel.com/sites/default/files/mkl-2019-developer-reference-c_2.pdf. Accessed 13 Nov 2019Dolz MF, Igual FD, Ludwig T, Piñuel L, Quintana-Ortí ES (2015) Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the intel xeon phi. Comput Electr Eng 46:95–111Dongarra JJ, Du Croz J, Hammarling S, Duff IS (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17Duran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(2):173–193Gates M, Luszczek P, Abdelfattah A, Kurzak J, Dongarra J, Arturov K, Cecka C, Freitag C (2018) C++ API for BLAS and LAPACK. Tech Rep 2, ICL-UT-17-03 (2017). Revision 21 Feb 2018Guennebaud G, Jacob B et al (2019) Eigen v3. http://eigen.tuxfamily.org. Accessed 13 Nov 2019LAPACK project home page. http://www.netlib.org/lapack. Accessed 13 Nov 2019Leung J, Kelly L, Anderson JH (2004) Handbook of scheduling: algorithms, models, and performance analysis. CRC Press Inc, Boca Raton, FLSmith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 28th IEEE International Parallel & Distributed Processing SymposiumStrazdins P (1998) A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech Rep TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, AustraliaWhaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35Van Zee FG, Implementing high-performance complex matrix multiplication via the 1m method. ACM Trans Math Softw (submitted)Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1–14:33Van Zee FG, Parikh DN, van de Geijn RA, Supporting mixed-domain mixed-precision matrix multiplication within the BLIS framework. ACM Trans Math Softw (submitted)Van Zee FG, Smith T (2017) Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans Math Softw 44(1):7:1–7:36Van Zee FG, Smith T, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels J, Low TM, Marker B, Killough L, van de Geijn RA (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):12:1–12:1

    Effect of zinc intake on growth in infants: A meta-analysis

    Get PDF
    A systematic review and meta-analysis of available randomized controlled trials (RCTs) was conducted to evaluate the effect of zinc (Zn) intake on growth in infants. Out of 5500 studies identified through electronic searches and reference lists, 19 RCTs were selected after applying the exclusion/inclusion criteria. The influence of Zn intake on growth was considered in the overall meta-analysis. Other variables were also taken into account as possible effect modifiers: doses of Zn intake, intervention duration, nutritional status, and risk of bias. From each select growth study, final measures of weight, length, mid upper arm circumference (MUAC), head circumference, weight for age z-score (WAZ), length for age z-score (LAZ), and weight for length z-score (WLZ) were assessed. Pooled β and 95% confidence interval (CI) were calculated. Additionally, we carried out a sensitivity analysis. Zn intake was not associated with weight, length, MUAC, head circumference, and LAZ in the pooled analyses. However, Zn intake had a positive and statistically effect on WAZ (β = 0.06; 95%CI 0.02 to 0.10) and WLZ (β = 0.05; 95%CI 0.01 to 0.08). The dose–response relationship between Zn intake and these parameters indicated that a doubling of Zn intake increased WAZ and WLZ by approximately 4%. Substantial heterogeneity was present only in length analyses (I2 = 45%; p = 0.03). Zn intake was positively associated with length values at short time (four to 20 weeks) (β = 0.01; CI 95% 0 to 0.02) and at medium doses of Zn (4.1 to 8 mg/day) (β = 0.003; CI 95% 0 to 0.01). Nevertheless, the effect magnitude was small. Our results indicate that Zn intake increases growth parameters of infants. Nonetheless, interpretation of these results should be carefully considered

    Programming parallel dense matrix factorizations with look-ahead and OpenMP

    Get PDF
    [EN] We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.The researchers from Universidad Jaume I were supported by the CICYT Projects TIN2014-53495-R and TIN2017-82972-R of the MINECO and FEDER, and the H2020 EU FETHPC Project 671602 "INTERTWinE". The researchers from Universidad Complutense de Madrid were supported by the CICYT Project TIN2015-65277-R of the MINECO and FEDER. Sandra Catalan was supported during part of this time by the FPU program of the Ministerio de Educacion, Cultura y Deporte. Adrian Castello was supported by the ValI+D 2015 FPI program of the Generalitat Valenciana.Catalán, S.; Castelló, A.; Igual, FD.; Rodríguez-Sánchez, R.; Quintana Ortí, ES. (2020). Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Computing. 23(1):359-375. https://doi.org/10.1007/s10586-019-02927-zS359375231Anderson, E., Bai, Z., Susan Blackford, L., Demmel, J., Dongarra, J.J., Croz, J.D., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.C.: LAPACK Users’ guide. SIAM, 3rd edition (1999)Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using SMPSs. Conc. Comp. 21, 2438–2456 (2009)Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Ortí, E.S., van de Geijn, R.A.: The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw. 31(1), 1–26 (2005)Bischof, C.H., Lang, B., Sun, X.: Algorithm 807: the SBR toolbox–software for successive band reduction. ACM Trans. Math. Softw. 26(4), 602–616 (2000)Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)Castelló, A., Mayo, R., Sala, K., Beltran, V., Balaji, P., Peña, A.J.: On the adequacy of lightweight thread approaches for high-level parallel programming models. Future Gener. Comput. Syst. 84, 22–31 (2018)Castelló, A., Peña, A.J., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S.: A review of lightweight thread approaches for high performance computing. In: Proceedings of the IEEE International Conference on Cluster Computing, Taipei, Taiwan (September 2016)Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLT: a unified API for lightweight thread libraries. In: Proceedings of the IEEE International European Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain (August 2017)Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLTO: on the adequacy of lightweight thread approaches for OpenMP implementations. In: Proceedings of the International Conference on Parallel Processing, Bristol, UK (August 2017)Catalán, S, Herrero, JR., Quintana-Ortí, E.S., Rodríguez-Sánchez, R., van de Geijn, R.A.: A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting. CoRR (2016) arXiv:1611.06365Catalán, S., Igual, F.D., Mayo, R., Rguez-Sánchez, R.: Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Clust. Comput. 19(3), 1037–1051 (2016)Chameleon project. http://project.inria.fr/chameleon/Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Paris (1997)Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)FLAME project home page. http://www.cs.utexas.edu/users/flame/Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008)Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008)Grosser, B., Lang, B.: Efficient parallel reduction to bidiagonal form. Parallel Comput. 25(8), 969–986 (1999)Gunter, B.C., van de Geijn, R.A.: Parallel out-of-core computation and updating the QR factorization. ACM Trans. Math. Soft. 31(1), 60–78 (2005)IBM. Engineering and Scientific Subroutine Library. http://www-03.ibm.com/systems/power/software/essl/ (2015)Intel. Math Kernel Library. https://software.intel.com/en-us/intel-mkl (2015)OmpSs project home page. http://pm.bsc.es/ompsshttp://www.openblas.net (2015)OpenMP API specification for parallel programming. http://www.openmp.org (2017)PLASMA project home page. http://icl.cs.utk.edu/plasmaQuintana-Ortí, E.S., van de Geijn, R.A.: Updating an LU factorization with pivoting. ACM Trans. Math. Softw. 35(2), 11:1–11:16 (2008)Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R.A., Van Zee, F.G., Chan, E.: Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36(3), 14:1–14:26 (2009)Rodríguez-Sánchez, R., Catalán, Sandra, H., José, R., Quintana-Ortí, E.S., Tomás, A.E.: Two-sided reduction to compact band forms with look-ahead (2017) CoRR, arXiv:1709.00302Seo, S., Amer, A., Balaji, P., Bordage, C., Bosilca, G., Brooks, A., Carns, P., Castelló, A., Genet, D., Herault, T., Iwasaki, S., Jindal, P., Kale, S., Krishnamoorthy, S., Lifflander, J., Lu, H., Meneses, E., Snir, M., Sun, Y., Taura, K., Beckman, P.: Argobots: a lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. PP(99), 1–1 (2017)Smith, T.M., van de Geijn, R., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS’14, pp. 1049–1059 (2014)StarPU project. http://runtime.bordeaux.inria.fr/StarPU/Stein, D., Shah, D.: Implementing lightweight threads. In: USENIX Summer (1992)Strazdins, P.: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia (1998)Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1–14:33 (2015)Whaley, C.R., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of SC’98 (1998)Van Zee, F.G., Smith, T.M., Marker, B., Low, T., Van De Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J.A., Killough, L.: The BLIS framework: experiments in portability. ACM Trans. Math. Softw. 42(2), 12:1–12:19 (2016

    Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

    Full text link
    [EN] We introduce a high performance, multi-threaded realization of the gemm kernel for the ARMv8.2 architecture that operates with 16-bit (half precision)/queryKindly check and confirm whether the corresponding author is correctly identified. floating point operands. Our code is especially designed for efficient machine learning inference (and to a certain extent, also training) with deep neural networks. The results on the NVIDIA Carmel multicore processor, which implements the ARMv8.2 architecture, show considerable performance gains for the gemm kernel, close to the theoretical peak acceleration that could be expected when moving from 32-bit arithmetic/data to 16-bit. Combined with the type of convolution operator arising in convolutional neural networks, the speed-ups are more modest though still relevant.This work was supported by projects TIN2017-82972-R and RTI2018-093684-B-I00 from the Ministerio de Ciencia, Innovacion y Universidades, project S2018/TCS-4423 of the Comunidad de Madrid, project PR65/19-22445 of the UCM, and project Prometeo/2019/109 of the Generalitat Valenciana.San Juan-Sebastian, P.; Rodríguez-Sánchez, R.; Igual, FD.; Alonso-Jordá, P.; Quintana-Ortí, ES. (2021). Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors. The Journal of Supercomputing. 77(10):11257-11269. https://doi.org/10.1007/s11227-021-03636-41125711269771

    A new generation of task-parallel algorithms for matrix inversion in many-threaded CPUs

    Get PDF
    We take advantage of the new tasking features in OpenMP to propose advanced task-parallel algorithms for the inversion of dense matrices via Gauss-Jordan elimination. Our algorithms perform a partitioning of the matrix operand into two levels of tasks: The matrix is first divided vertically, by column blocks (or panels), in order to accommodate the standard partial pivoting scheme that ensures the numerical stability of the method. In addition, depending on the particular kernel to be applied, each panel is partitioned either horizontally by row blocks (tiles) or vertically by µ-panels (of columns), in order to extract sufficient task parallelism to feed a many-threaded general purpose processor (CPU). The results of the experimental evaluation show the performance benefits of the advanced tasking algorithms on an Intel Xeon Gold processor with 20 cores.This research was sponsored by projects RTI2018-093684-B-I00 and TIN2017-82972-R of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; and project PR65/19-22445 of Universidad Complutense de Madrid.Peer ReviewedPostprint (author's final draft
    corecore